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ABSTRACT 

Given a discretization stencil, partitioning the problem domain is an important first 
step for the efficient solution of partial differential equations on multiple processor 
systems. We derive partitions that minimize interprocessor communication when the 
number of processors is known a priori and each domain partition is assigned to a different 
processor. Our partitioning technique uses the stencil structure to select appropriate 
partition shapes. For square problem domains, we show that non-standard partitions 
(e.g., hexagons) are frequently preferable to the standard square partitions for a variety of 
commonly used stencils. We conclude with a formalization of the relationship between 
partition shape, stencil structure, and architecture, allowing selection of optimal partitions 
for a variety of parallel systems. 
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1. Introduction 

Problem transformation has long been among the most successful solution paradigms. As 
an example, consider the solution of elliptic partial differential equations [Orte85]. Given some 
planar region R , the classical central difference technique covers the region R with a rectangular 
grid and replaces the derivatives at each grid point with central differences. The resulting system 
of linear equations is then amenable to solution via a variety of efficient algorithms. This 
transformation, from partial differential equation to linear system, makes the solution both 
feasible and attractive. Within this framework there remain several alternatives, both in the 
choice of discretization stencil (e.g., 5-point or 9-point) and the linear system solver (e.g., direct 
or iterative), and the most appropriate choices depend on the problem. 

When one considers parallel solution of partial differential equations, an additional 
paradigm, problem domain decomposition [Voig85], arises. If multiple processors are to 
cooperate, each solving the linear equations on a portion of the grid, the selection of grid 
partitions and their assignment to processors are crucial to good performance. 

In this paper, we consider the parallel solution of elliptic partial differential equations over a 
planar region, using both shared memory and message passing architectures. Historically, only 
rectangular partitions of the discretization grid have been assigned to processors, primarily 
because the resulting data structures are regular. However, triangles, squares (a special case of 
rectangles), and hexagons also tessellate the plane. The effects of these partitions on inter- 
processor communication and their relation to the discretization stencil are investigated. 
Because partitions like hexagons have a higher area to perimeter ratio than rectangles and 
potentially less interpartition communication, there is incentive to investigate their attributes. 

Our results show that the efficiency of the parallel solution depends on the partitioning of 
the discretization grid, its associated stencil, and the underlying architecture. Observing that the 
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amounts of required computation and communication are functions of a partition’s area and 
perimeter, respectively, we compare the performance of a variety of associated stencil/partition 
pairs on both message passing and shared memory architectures. However, we begin with a 
survey of related work and a formal specification of the problem. 

1.1. Related Work 

In a study of hypercube performance, Fox and Otto [Fox84] recently noted that the 
efficiency of a parallel algorithm is not determined by the amount of communication but the 
ratio of communication to calculation. In their study, they considered the solution of Laplace’s 
equation over a square region using a 5-point discretization stencil. Their partitioning placed 
squares of grid points on each node of the hypercube, using only nearest neighbor 
communication. This choice of partitioning has a lower communication to computation ratio 
than the natural alternative, partitioning the grid into an equal number of rectangular strips. 

Vrsalovic, et al. [Vrsa85] have also considered the solution of Poisson’s equation over a 
square region using a 5-point discretization stencil. Unlike Fox and Otto, they tested triangular, 
square, and hexagonal partitions. Their study used the ratio of processing time to data access 
time as one performance metric when comparing the speedup of different partitions on a general 
class of multiprocessor systems. Their hypothetical multiprocessor systems were assumed to 
have both local memory attached to each processor and global memories accessible via an 
interconnection network. Of the three partitions, hexagonal decomposition produced the largest 
speedup. 

In an experimental study, Saltz, et al. [Salt86] considered solution of the heat equation using 
successive over-relaxation (SOR) on an Intel iPSC [Ralt85]. Rectangular strips and squares were 
used as grid partitions. They observed that the Intel iPSC’s high startup costs for message 
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transmission often favored decreasing the number of messages sent, even if that meant sending 
more bytes of data. Hence, partitions of rectangular strips were often more efficient that square 
partitions. 

Superficially, these results by Fox and Otto, Vrsalovic, et al., and Saltz et al. seem 
mutually contradictory - each favoring different partition shapes. However, these studies 
considered only a small portion of the possible parameter space of stencils, partitionings, and 
architectures. Moreover, their underlying assumptions differ. This paper presents a formal 
method for analyzing stencil/partition/architecture triplets and applies this method to a variety 
of these triplets. Section 2 begins by computing the total number of points in a partition versus 
the number of points that must be communicated for several common stencils using each of the 
rectangular, square, triangular, and hexagonal partitions. In section 3, these results are used to 
determine those stencil/partition pairs that maximize the ratio of computation to 
communication. Finally, section 4 compares the performance of an algorithm for solving 
Laplace’s equation over a square region using different stencil/partition pairs on both shared 
memory and message passing architectures. 

2. Communication Costs for Selected Stencil/Grid Partition Pairs 

Elliptic partial differential equations, particularly the Laplace and Poisson equations, have 
long been used as test vehicles for new solution algorithms and parallel architectures. 
Consequently, our study is based on the following problem formulation. 

The Problem: Consider an elliptic partial differential equation with Dirichlet 

boundary conditions on some square region R . If R is discretized to 
contain N = n 2 points, we wish to solve the resulting linear system 
using a point Jacobi iterative solver on a parallel processor 
containing p processors (PEs), where p < TV. 
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One interesting question immediately arises. Suppose the grid were divided with each 
partition placed in a different PE and that each PE used the point Jacobi iterative solution 
technique. 1 In this scenario, each PE repeatedly updates its partition of grid points and sends 
values associated with its partition boundary to logically adjacent partitions. What partition 
structure would maximize the ratio of computation to communication? One immediately 
observes that 

• computation is a function of a partition’s area, 

• communication is a function of a partition’s perimeter, and 

• the partition’s perimeter that must be sent to other partitions is a function of the stencil. 
As an example, Figure 2.1 illustrates square partitions with a 5-point stencil. Each partition 
communicates with four neighboring partitions, and the amount of data transferred is directly 
proportional to the perimeter of the partition. Although convergence checking for an iterative 
scheme also involves communication, the amount and cost of this communication is independent 
of stencil type and partition shape and will not be considered. (It is interesting to note that the 
communication required for the inner products of the conjugate gradient method is also 
independent of stencil type and partition shape.) 

In the remainder of this section, we analyze the expected amount of data that must be 
transferred between partitions, given possible stencil/partition pairs. In a later section, we will 
consider the influence of parallel architecture on the choice of a stencil/partition pair. 


‘The iterates generated by our parallel Jacobi method are the same as those generated by the sequential Jacobi 
method. We also emphasize that our analysis techniques can be applied to other point iterative solvers (e.g., mul- 
ti color SOR and conjugate gradient) as well. 
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Figure 2.1 Square partitions with 5-point stencil 


2.1. Five Point Stencil 

Figure 2.1b shows the 5-point stencil and the equations for the unknowns in Laplace’s 
equation that arise from the standard centered difference approximation to the partial 
derivatives. With an iterative solution of these equations (e.g., point Jacobi), the new value 
computed at each grid point depends on the previous values from its north, south, east, and west 
grid point neighbors. 

Using this stencil, we now consider the influence of partition shape on inter-PE 
communication. To ease comparison, we assume each partition contains n 2 /p grid points (i.e., 
each PE’s computation is proportional to n 2 /p). 
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2.1.1. Rectangular Partitions 

Suppose the grid of n 2 points were partitioned into horizontal strips, and each strip were 

again partitioned into r rectangles; see Figure 2.2a. Assuming all rectangles are of equal size, 
n 2 

each contains — - grid points with sides y and As illustrated in Figure 2.2b, the perimeter 

contains 2 1 — 4 grid points and all are involved in data transfer. However, the four 

r p 7 

corner points in each rectangle involve two (2) data transfers. Therefore, the data transferred 

from each interior rectangle is 2 

To find an optimal value for r, the number of horizontal rectangles, we need only maximize 
the ratio of computation to communication 



nr . , 
points 

V 


(a) (b) 


Figure 2.2 Rectangular partitions with 5-point stencil 
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in a single PE. Differentiating and setting the derivative equal to zero, we obtain p=r 2 or 
r=\/p as the optimal value of r. Therefore , squares are the optimal rectangular partitioning for 

• • • 4 n 

the 5-point stencil with a communicating perimeter of » . • With the 5-point stencil, this result 

Vp 

has a simple geometric interpretation: of all rectangular partitions, the square maximizes the 
area/perimeter ratio. 

Finally, as an interesting special case, note that if r = 1, the grid of n 2 points is partitioned 

n 2 

into p strips each containing points. In this case, there is no communication to the east or 

P 

west and 2n — 4 values (n — 2 north and n — 2 south) are communicated from each partition. 2 
2.1.2. Triangular Partitions 

To partition an nXn grid into p triangles we assume n = 2Vp7 and divide the grid into -2- 

2 

squares with sides s = 2V2/. Each of these -2- squares will contain 8/ 2 grid points. Each of the 

2 

squares is then divided into the two “approximate” triangles shown in Figure 2.3a. Each of the 
p triangles contains 4/ 2 grid points and has height s and base s — 1. 

Now consider the communicating perimeter of the upper triangle in Figure 2.3a, assuming a 
5-point stencil. By observation, s values are sent north, s— 1 values east, s values south, and 1 
value to the west, for a total of 3s. Note that s — 2 of the values transmitted south are used 
twice by the receiving triangle. The other triangles are reflections of this case. Because 

n = 2 Vp7 and s = 2V2 /, the total number of values sent from each triangle is AVA n _ 

Vp 

2 The four corner points of the partition are fixed boundary values that need not be transmitted. 



2.1.3. Hexagonal Partitions 


Now consider dividing the nXn grid into p hexagonal partitions. We again assume that 
_ n 2 

n = 2 vpl implying each partition has = 4/ 2 grid points. Figure 2.4 shows how this 

P 

partitioning can be accomplished. Each hexagon has / + 1 grid points at the north and south 
edges and / grid points on each of the four remaining sides. The number of grid points in the 
upper or lower half of each hexagon is 

i:[(' + i)+2(.-i)]=2; 2 , 

for a total of 4/ 2 in each hexagon. 





As Figure 2.4 shows, / + 1 values must be sent north, / + 1 values south, l northeast, / 


southeast, / southwest, and / northwest, a total of 6/ +2 . 


Because / — 


n 

2V7 


, each hexagon must 


communicate — 7=- +2 values. 

Vp 


2.2. Nine Point Stencil 

The 9-point stencil, shown in Figure 2.5, is a higher order finite difference approximation to 
the partial derivatives than the 5-point stencil discussed earlier. When using this stencil, the 
iteration value computed at each grid point is a function of its north, northeast, east, southeast, 
south, southwest, west, and northwest grid point neighbor values. In this section we examine the 
amount of inter-PE communication for the same partitions discussed earlier and observe the 
change in a partition’s communicating perimeter as the stencil changes. 
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Figure 2.5 9-point star stencil 


2.2.1. Rectangular Partitions 


Figures 2.2 and 2.5 show that the communicating perimeter of rectangular partitions for 
the 9-point stencil is nearly the same as the communicating perimeter for the 5-point stencil. 
Only the four corner points of a partition are each involved in an additional communication. As 

before, squares are the optimal rectangular partitioning with a communicating perimeter of 
4 n . 

+ 4. Because there is no communication to the left or right, rectangular strips (r = 1) have 


the same communicating perimeter for both the 5 and 9-point stencils. 
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2.2.2. Triangular Partitions 

The dashed lines between grid points in Figure 2.6 highlight the additional communications 
required for triangular partitions when using the 9-point stencil rather than the 5-point stencil. 
The solid lines between grid points are the communicating perimeter for the 5-point stencil (3s). 
The 9-point stencil requires the following additional communications: 1 to the northeast, 1 to 
the southeast, 1 to the northwest, 1 to the southwest, and s —2 to the south. This yields a total 
communicating “perimeter” for an interior triangular partition with the 9-point stencil of 

4s + 2 or +2. “Perimeter” is perhaps a misnomer here, for the perimeter of points along 

Vp 

the diagonal in Figure 2.6 is “two deep” for the 9-point stencil. 



Figure 2.6 Triangular partitions with 9-point star stencil 
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2.2.3. Hexagonal Partitions 

The dashed lines in Figure 2.7 illustrate the the additional communications required with 
hexagonal partitions when using the 9-point stencil instead of the 5-point stencil. The solid lines 
of Figure 2.7 correspond to the communicating perimeter of the 5-point stencil, shown to be 
6/ +2 in section 2.1.3. The 9-point stencil requires / communications to the northeast, 
southeast, southwest, and northwest in addition to those for the 5-point stencil. This gives a 
total communicating perimeter, for interior hexagonal partitions, of 


10/ + 2 = 


5 n 

vr 


+ 2 


where 



Figure 2.7 Hexagonal partitions with 9-point star stencil 
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/ = 


n 


2 Vp ’ 


Note that the communicating “perimeter” is depth 2 along four of the six edges. 


2.3. Other Stencils 

Many stencils other than the 5-point and 9-point stencils analyzed above are frequently 
used when solving partial differential equations. Figure 2.8 illustrates some of the most common. 
For brevity’s sake, we do not include the analysis of the communication required for their 
associated partitions. However, the results of this analysis are summarized in Table 2. The 
interested reader can verify these results by applying the methods discussed earlier to compute 
the additional grid points involved in data transfer for each of these stencils. 
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Figure 2.8 Frequently used discretization stencils 
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2.4. Computation/Communication Ratios 

Before summarizing the results of the previous section, we introduce the notation shown in 
Table 1. Using this notation, Table 2 shows relative amounts of computation and 
communication for selected stencil/partition pairs. For simplicity, the effects of boundaries on 
communication have been elided. (Recall that n 2 is the number of grid points, and p is the 
number of processors.) Table 2 also includes one quantity not discussed earlier, parallel 
communication, the amount of data transfer if partition sides can communicate in parallel. This 
parallel communication will later allow us to determine if the optimal stencil/partition changes 
when communication to neighboring partitions can be done in parallel. 

The entries of most interest in Table 2 are the ratio of computation to communication (R) 
and the ratio of computation to parallel communication (PR). Table 3 illustrates the relative 
magnitude of these quantities for a square grid containing 256X256 points and a parallel system 
with 64 processors. 


Table 1 Static scheduling notation 


Quantity 

Definition 

Comp 

n 2 

the computational complexity of a stencil/partition pair 

Comm 

communication complexity of a stencil/partition pair 

Pcomm 

parallel communication complexity of a stencil/pair 

R 

the ratio Gomp 


Comm 

PR 

the ratio Gomp 


Pcomm 
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Table 2 Summary of stencil/partition analysis 


Partition 

Stencil 


5-point 

7-point 

9-pozn£ sfar 

9-point cross 

13-point 

Rectangular 

Strips 






Comm; 

2 n 

2n 

2n 

4 n 

4 n 

Pcomm: 

n 

n 

n 

2n 

2n 

R: 

n 

n 

n 

n 

n 

2p 

2 P 

2 P 

4 P 

4p 

PR: 

n 

n 

n 

n 

n 

p 

p 

p 

2p 

2p 

E3BS9H 

3V2n 

3V2n „ 

4V2n „ 

6V2n 

6V2n „ 

Comm: 

Vp 

^7T +2 

-vr +2 

Vp 

^v7~ + 

Pcomm: 

V2n 

2\/2n „ 

2V2n „ 

2V2ri , 

2V2n , 

Vp - 

"VT 

"VT“ 2 

Vp 

1 

R: 

n 

n 

n 

n 

n 

3V2p 


~ 4V2p 

6V2p 

~ 6V2p 

PR: 

n 


n 

n 

n 

V 2p 

KCT 1 

2 V 2 p 

2 V 2 P 

2V2p 

Square 






Comm: 

4 n 

4n 

4n 

8n 

8n . 


Vp 


W 

vr +4 

Pcomm: 

n 

W 

n 

W 

n 

W 

2n 

W 

2n 

R: 

n 

n 

_ n 

n 

_ , n 


4Vp 

4Vp 

8Vp 

~8 Vp 

PR: 


n. 

n 

n 

n 

ras 

Vp _ 




Hexagon 






Comm: 

3ft n 

4n 4-2 

w 

5,1 4-2 

v7 


vV 8 

Pcomm: 

H , 11 

n 

n 

vr +2 

•vr +2 

2 Vp 

v7 

Vp 

R: 

n 

n 

n 

n 

n 

3Vp 

4Vp 



~W7 

PR: 

2n 

n 

n 

n 


- v P 

W 

w 


~ Vp 


NOTE: Comp — n 2 /p is used in computing R and PR in all cases. 
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An inspection of Table 3 shows that hexagonal partitions yield the highest ratio of 
computation to serial communication, except for the 9-point star stencil, where squares are 
better. However, if one assumes the inter-partition communication can be done in parallel (i.e., 
all edges of a partition can be transmitted in parallel), hexagons yield the highest ratio in all 
cases. With parallel communication, the improvement obtained with hexagons is even greater 

(e.g., -^ eiag ° n = 1.33 for the 5-point stencil but Aeiag<m = 2). 

square PR square 

The patterns in Table 3 suggest there is some formal relation between partitions and 
stencils, with certain combinations preferred. In the next section we develop techniques for 
selecting optimal partition/stencil combinations. 


Table 3 

Ratio of computation to communication (n = 256 and p = 64) 


Partition Type 

Stencil 


5-point 

7-point 

9-point star 

9-point cross 

13-point 

Rectangle 






R: 

2 

2 

2 

i 

T 

PR: 

4 

4 

4 

2 

2 

Triangle 






R: 

7.5 

7.5 

5.65 

3.75 

3.75 

PR: 

22.5 

11.3 

11.3 

11.3 

11.3 

Square 






R: 

8 

8 

8 

4 

4 

PR: 

32 

32 

32 

16 

16 

Hexagon 






R: 

10.66 

8 

6.4 

5.3 

5.3 

PR: 

64 

32 

32 

32 

32 
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3. Determining Optimal Stencil/Partition Pairs 

Using the following definition, a partition can be categorized with respect to a given stencil 
by the number of partition perimeters that must be communicated. 


Definition : A partition is a k-partition with respect to stencil S' if A: perimeters are 

communicated when stencil S is used. 

For example, the square is a 1-partition with respect to the 5-point, 7-point, and 9-point star 
stencils but is a 2-partition with respect to the 9-point cross and 13-point stencils. The hexagon 
is a 1-partition for the 5-point and a 2-partition with respect to the 9-point cross and 13-point 
stencils. 


Moreover, the value of k can be a fraction. The hexagon, for example, is a 1 — partition for 

6 

the 7-point stencil and a 1— partition with the 9-point star stencil. Why? Because only some 

6 

sides of the hexagon are involved in multiple data transfers. This categorization of partitions 
with respect to stencils provides a ranking mechanism for stencil/partition pairs. Hence, one can 
determine those stencils where /-partition hexagons are preferable to k-partition squares. 

When communication from a partition to each of its neighboring partitions is done serially, 

... ... 4 kn . 

the communicating perimeter for square A: -partitions is nearly , and the corresponding ratio 

Vp 


of computation to serial communication is 


4k Vp 


The communicating perimeter for hexagonal 


3 In 


/-partitions is approximately — and the corresponding ratio of computation to serial 

Vp 


communication is * - Clearly, an /-partition hexagon yields a higher ratio when 


n _ ^ n 

3 /Vp 4 k\/ p 


or when 
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k > — /. 

4 


(3.1) 


If one adopts parallel rather than serial communication, the communicating perimeter for 


r, 

square A: -partitions is, except for a small constant, — , and the ratio of computation to parallel 


communication is . Similarly, the communicating perimeter for hexagonal /-partitions is 


k Vp 


In 


- and the corresponding ratio of computation to parallel communication is ,. 2 % . With 

4vp 3 iVp 

parallel communication, /-hexagons are preferable to A -squares when 


2 n y n 


iVp kVp 


or 


k 



(3.2) 


Using inequalities (3.1) and (3.2), Table 4 shows optimal stencil/partition pairs, based on 
the maximum ratio of computation to communication. Table 4 shows that square partitions are 
better than hexagons in only one of the 10 cases. Note that the k and /-values for parallel 
communication in Table 4 were obtained by rounding the fractional values for serial 

communication up to the next largest integer (i.e., a parallel communication of 1— perimeters 

6 

requires two transmissions). Based solely on Table 4, hexagonal partitions are superior to square 

partitions because they minimize the interpartition data transfer. 3 Similarly, triangles are clearly 
inferior. 


3 As we shall see, the underlying parallel architecture also influences the choice of partition shape. 
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Table 4 Comparison of Square and Hexagonal Partitions 


Stencil 

Square k-value 
(serial, parallel) 

Hexagon 1 -value 
(serial, parallel) 

Optimal partition 

serial: k > — 

4 

Optimal partition 
parallel: k > 

5 -point 

(i.i) 

(i.i) 

3 

hexagon (1 > — ) 

4 

hexagon(l > ■—) 

7 -point 

(i.i) 


equal (1 = j*|-) 

equal (1 = -i-2) 

9-point star 

(i.i) 


square(l < -^--^p) 

equal (1 = y2) 

9-point cross 

(2,2) 

(2,2) 

hexagon (2 > —‘2) 
4 

hexagon (2 > 1) 

IS-point 

(2,2) 

(2,2) 

3 

hexagon (2 > — *2) 

4 

hexagon (2 > 1) 


4. Architecture and the Performance of Stencil/Partition Pairs 

Our previous analysis did not include architectural considerations, save for the inclusion of 
results for both serial and parallel communication. However, the stencil and grid partition 
cannot be divorced from the processor connectivity of a message passing architecture (e.g., square 
or hexagonal grid) or the storage schema used in a shared memory multiprocessor. Optimal 
performance can be achieved only via judicious selection of a trio: stencil, partitioning, and 
architecture. 

Deriving expressions for parallel execution times and speed-ups for a 
stencil/partition/architecture trio requires a model of execution. Our parallel execution time 
model is a variation of one we developed earlier [Reed85] and is similar to the one used by 
Vrsalovic, et al. [Vrsa85], In this model, the parallel iteration time for evaluating one partition 
of grid points is 

tp—proceteor ± _i_ f 4- t 

l cyc(e l comp ' l a ' 
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where t comp is the iteration computation time, t a is the data access/transfer time, and t w is the 
waiting/synchronization time. 

The computation time t comp depends on the partition size and stencil, and is independent of 
the architecture except for the time, Tj p) to execute a floating point operation. Formally, t comp is 

, __ E(S)n 2 „ 

l comp J-Jp 

where E{S) is the number of floating point operations required to update the value of a grid 

n 2 

point, given a stencil S, — — is the number of grid points in a partition, and T fp is the time for a 
single floating point operation. 

The speedup obtained using parallel iterations is simply 


iuntp 
g 1 cyc le 


uniprocessor 


t p ~P 

* cycle 


where the single processor iteration time is just 


(4.1) 


^uniprocessor 

■ cycle “ 1 j p . 

Specific values for the speedup depend not only on the trio of stencil, partition, and network 
chosen, but also on the technology constants (e.g., floating point operation time and packet 
transmission time). 

The other components of the execution time model, t a and t w , depend on the particular 
combination of partitioning, stencil, and architecture and are analyzed below. 


4.1. Message Passing Architectures 

Among the competing classes of parallel machines, message passing architectures occupy an 
important niche. The recent emergence of commercial message passing machines (.e.g., the Intel 
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hypercube [Ratt85]) has stimulated great interest in this area. 

Each processor in a message passing machine contains a local memory and is connected to a 
(necessarily) small number of other processors. Access to data contained in another processor’s 
memory requires transfer of that data via the interconnection network. Clearly, the performance 
of a stencil/partition pair depends heavily on the performance of the interconnection network of 
the multiprocessor system. Although a plethora of interconnection networks have been proposed 
[Reed83, Witt81], Figure 4.1 shows those networks (meshes) that are directly relevant to iterative 
solution of elliptic partial differential equations. Each interconnection network has an associated 
“natural” partition (e.g., square partitions on a square mesh). 

Consider an interior processor in one of the partition/mesh pairs. During each iteration 
(cycle), two groups of data must cross each communications link, one in each direction from 
neighboring processors. There are several possible interleavings of computation and remote data 
access. These range from a separate request for each communicating “perimeter” grid point 
when it is needed to a request for an entire “side” of the communicating “perimeter” of the 
partition. These requests can, in turn, be either overlapped or non-overlapped with 
computation. Similarly, the hardware support for interprocessor communication must be 
specified. A simple hardware design allows only one link connected to each processor to be active 
at any time, increasing the data transfer time. With additional hardware, each processor link 
can be simultaneously active. 

Each combination of data access patterns and hardware design alternatives leads to an 
implementation with different performance characteristics. Rather than cursorily examine a wide 
variety of alternatives, we have chosen to examine a smaller set in detail. Specifically, we assume 

• communication links are half-duplex (i.e., data can flow along links in only one direction at 

a time) and 
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Figure 4.1 Selected interconnection networks 


• processors request and wait for all perimeter values before starting computation. 

Currently, these assumptions correspond to all commercial hypercube implementations [Ratt85]. 

Whether the communication is serial or parallel, some processor P, in the interior of the 
network will need data from another processor P ; - that is some number of links away. (See 
Table 5 for notation.) The amount of data to be transmitted, d {j (S, P ), depends on both the 
stencil S and the grid partitioning P . Ignoring synchronization and queueing delays, the time to 
transmit data from P- to P ; , crossing links, is 
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Table 5 Execution time model notation 


Quantity 

Definition 

d<i(S,P) 

amount of data sent from i to j 

k 

number of links between i and j 

P 

partition 

Pi 

processor i 

Ps 

packet size 

S 

discretization stencil 

S r 

speedup 

T f P 

time for a single floating point operation 

l parallel 

parallel access time 

i serial 
l a 

serial access time 

^ comm 

time to send a packet across one communication link 

^ cycle 
f forward 

time for one iteration 

time (possibly zero) to interrupt an intermediate processor 
and forward a message 


data transmission time from processor t to j 

^ startup 

overhead for preparing a communication 

± serial 
l w 

serial waiting time 


(*" 9 J ) ^startup 


M s > p ) 


Ps 


hj^comm ihj ^forward > 


(4.2) 


where t 8tartup is the fixed overhead for sending data, t comm is the packet transmission time, and 
^ forward ls messa g e forwarding overhead incurred at intermediate processors. The ceiling 
function reflects the redundant communication due to the fixed packet size Ps. 


In general, data destined for other processors will encounter queueing delays, both at their 
origin and at intermediate nodes. The latter is expected, but the former is counter-intuitive. As 
an example of this phenomenon, consider the mapping of hexagonal partitions onto either a 
square or hexagonal mesh. On a square mesh, data from the six sides of the hexagons must exit 
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via only four connecting links. Even with all links simultaneously active, some data will be 
delayed. 

With hexagonal partitions on a hexagonal mesh, each partition edge is directly connected to 
its six neighboring partitions. However, each pair of neighbors must exchange data. Thus, two 
transmission delays are needed on each of the six links before exchanges are complete. If all links 
can simultaneously be active, two transmission delays will suffice to exchange boundary values. 
Conversely, twelve delays will be needed if only one link per processor can be active at any time. 

There are two general approaches to managing the interpartition communication problem. 
The first relegates management of message passing and the associated queueing of messages for 
available links to system software residing in each processor. With this approach, each partition 
simply passes the data to be delivered to other partitions to the system software. No 
consideration is given to the pattern of communication in time. As an example, each partition 
might successively send boundary values on each of its links, then await receipt of boundary 
values from neighboring partitions. Although this approach is attractive from a programming 
standpoint, it hides the performance issues and may lead to increased contention for 
communication links. 

The second approach requires programming the exchange of partition boundaries in a series 
of phases , each phase corresponding to a particular pattern of communication. In the example of 
hexagonal partitions on a hexagonal mesh, discussed above, the communication pattern of 
neighboring partitions would be alternating sends and receives. Sender and receiver would 
cooperate, each expecting the action of the other. This pseudo-SIMD mode of communication 
leads to regular communication patterns with minimal delays. Application of this approach is 
the subject of the next section. 
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4.2. Message Passing Analysis 

Because the range of partition and network possibilities is so large, we have opted to 
present only the analysis of the 5-point stencil with square and hexagonal partitions on square 
and hexagonal interconnection networks. The triangular partitions were omitted because, as a 
cursory examination of Figure 2.3 shows, they require data transmission to four adjacent 
partitions. Because square partitions also transmit to only four adjacent partitions and have a 
higher ratio of computation to communication, they are always preferable to triangles. The 
analysis for 9-point stencils is similar to that presented below; only the case analysis is more 
complex. 

When partitions are mapped onto an interconnection network, the processors may permit 
communication on only one link or on all. In the following we consider only the serial case; 
similar analysis applies to simultaneous communication on all links. 


We begin with the simplest case: square partitions on a square interconnection network. 


Each partition must exchange values with each of its four neighbors. Because only one link 

Vp 

per processor can be active, we expect the data exchange to require, four phases (i.e., time 

proportional to However, this would require all processors to simultaneously send and 

V p 

receive. At any given instant, only half the processors can send; the other half must receive. 


Hence, eight phases are needed, and the total time for data exchange is 


^serial i j.serial a / \ o n 

+ t w ~ 4t startup + ® p sX /~ 

Four startup costs are needed to initiate message transmissions to neighboring processors. 
Because square partitions map directly onto the square mesh, no intermediate node forwarding 
costs arise. Because the square mesh can be directly embedded in the hexagonal mesh, the data 
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exchange delay for square partitions on a hexagonal mesh is identical to that for the square 
mesh. 4 


Like square partitions on a square mesh, hexagonal partitions map directly onto a 

hexagonal mesh. Recalling that the north and south sides of a hexagon contain — + 1 points, 

2 vp 

and the other four sides contain — points each, the data exchange delay is 


i serial 
l a 


i / serial 

* l w 


6 1 startup 


2 Vp 


+ 1 


Ps 


n 


t + 8 

comm 1 w 


2 Vp 

Ps 


The first ceiling term corresponds to the north/south exchange and requires four phases. 
Similarly, the second term represents the exchange of data along the four diagonal connections 


and requires eight phases. 

Finally, hexagonal partitions can also be mapped onto a square mesh. Unlike the other 
mappings, this one requires data exchange between non-adjacent processors. In this case, we 
assume that rows of hexagons are mapped onto corresponding rows of the square mesh. With 
this mapping, north/south connections and half the diagonal connections are realized directly. 
The remaining diagonal connections require traversal of two links to "turn the corner" in the 
square mesh. Hence, the total communication delay due to data exchange is 


j. serial , /.serial 

* a ' *'vi ~~~ 


6 * startup + 4 


2 Vp 


+ 1 


Ps 


t 4-4 

comm 1 


2 Vp 


Ps 


l comm 4” 8 


2Vp 

Ps 


+ 4t 


' forward. * 


This is only true for the 5 point stencil. With the 9— point and other stencils, the distinction between square 
and hexagonal meshes is important. 
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The first ceiling term corresponds to the north/south connections and the second to the directly- 
connected diagonals, each with four phases. The third term represents indirectly connected 
diagonals, requiring eight phases. Half these phases require forwarding through intermediate 
nodes, hence the four forwarding costs. 

As noted earlier, similar analysis can be applied to other meshes and stencils. Table 6 
shows the number of other partitions with which each partition must communicate (i.e., the 
number of transmission startups). In addition, transmission delays are shown as a sum of terms. 
Each term is a product of the amount of data exchanged between logically adjacent partitions 
and the number of phases necessary to accomplish the exchange. In the table, the potential 
effects of packet size on transmission delay are ignored, as are the times for startup and 
forwarding. Table 6 suggests that hexagonal partitions are preferable for 5-point stencils, and 
square partitions are more appropriate for 9-point stencils, confirming our earlier, mesh 
independent analysis. As we shall see, however, both the number of message startups and 
amount of data must be considered when estimating the performance of a stencil/partition/mesh 
trio. 


4.3. An Evaluation of Stencils, Partitions and Meshes 

Equation (4.2), the delay to send data, includes parameters for startup, forwarding cost, 
packet size, and packet transmission time. Because our primary interest is the effect of 
transmission time, we have ignored the effects of startup and forwarding (i.e., we have assumed 
those parameters are zero). When evaluating the relative performance of stencil/partition/mesh 
trios, we have attempted to use values for packet size and packet transmission time based on 
those for commercial message passing machines. For example, the Intel iPSC [Ratt85] sends IK 
byte packets with a measured transmission time of between 6 and 7 milliseconds. 



Table 6 Message passing data exchange 





Number of 


Phase-Data 






Communicating 


Transmission 



Partition 

Mesh 

Stencil 

Partitions 


Products 


OfEzpected Delay) 

Square 

Square 

5-point 

4 

8 n 

W 



8 n 

W 

Square 

Square 

9-point 

8 

8 n 1P 

w 



8 n 

W 

Square 

Hexagon 

5-point 

4 

8 n 

"V p 



8 n 

W 

Square 

Hexagon 

9-point 

8 

! 

A 

£-+12 

'P 



8 n 

v P 

Hexagon 

Square 

5-point 

6 

4 

n 

2V P + 1 

. 4n 8m 

+ W7 + 

8n 

Hexagon 

Square 

9-point 

6 

4 

n 

Wi + 1 

4 n 

+ w + 

8 n 

W 

14m 

VT 

Hexagon 

Hexagon 

5-point 

6 

4 

[iW + 1 J 

8 n 


6m 

W 

Hexagon 

Hexagon 

9-point 

6 

4 

WT + 1 J 

8 n 

+ V7 


10m 

w 


Figure 4.2 shows the speedup, obtained using (4.1), of square and hexagonal partitions on 
both square and hexagonal meshes, using a 5-point stencil. In the figure, IK byte packets are 
used. We see that square partitions yield significantly larger speedup than hexagonal partitions, 
regardless of the underlying mesh. This is counter-intuitive and would seem to contradict Table 
6. Careful inspection of (4.2), however, shows that packet size is crucial. The term 


p > 


Ps 


l ij t comm 


in (4.2) accounts for the discretization overhead caused by packets. If Ps, the packet size, is 
large, the number of partitions that must receive data from each partition is much more 
important than the total amount of data to be sent. For example, sending 4 bytes to 6 partitions 
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Figure 4.2 

Speedup for 5-point stencil 
(1024 X 1024 grid with 1024 byte packets) 


Speedup 



Parameter 

Value 

Packet size 

1024 

Startup 

0.0 

Forwarding 

0.0 

Packet transmission 

6X10 -3 sec 

Floating point operation 

1X10“ 6 sec 
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is much more expensive than sending 6 bytes to 4 partitions if the packet size is 1024 bytes. The 
former requires 6 packet transmissions, the latter only 4 transmissions. Square partitions, 
because they require communication with only four neighboring partitions, are preferable to 
hexagonal partitions with six neighboring partitions, even though more data must be transmitted 
with square partitions. 

With a 4 byte floating point representation, a 1024X1024 grid, 1024 byte packets, and 
square partitions (the assumptions of Figure 4.2), using more than 16 partitions will not decrease 
the communication delay because, beyond this point, the total number of packet transmissions 
does not change. Instead, the ratio of useful computation to communication begins to degrade. 

As the packet size decreases, we would expect the differential in amount of transmitted 
data to become more important. For small enough packets, the total amount of data accurately 
reflects the delay. Figure 4.3 shows just this result. For smaller 16 byte packets used in the 
figure, hexagonal partitions are preferred over square partitions. 

Comparing Figures 4.2 and 4.3, we also see the effects of varying the number of processors. 
For a small number of processors, the iteration is compute bound. As the processors (and 
partitions) increase, the distinction between differing partition shapes becomes apparent. With 
1024 processors, only one grid point resides in each partition, and the effects of packet size on 
performance are striking. 

Figures 4.4 and 4.5 illustrate phenomena similar to those in Figures 4.2 and 4.3. For large 
packet sizes, Figure 4.4, hexagonal partitions are preferable to square partitions because the 
hexagons communicate with only six other hexagons, rather than eight other squares. However, 
the square partitions require less interpartition data transfer. Only when the packet size 
becomes small, Figure 4.5, does the potential advantage of square partitions become apparent. 
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Figure 4.3 

Speedup for 5-point stencil 
(1024 X 1024 grid with 16 byte packets) 
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Floating point operation 

1X10 -6 sec 
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Stencil, partition, mesh, and hardware parameters interact in non— intuitive ways. For 5— 
point stencils, square partitions, with their smaller number of communicating neighbors, are 
appropriate for large packet sizes. Likewise, hexagonal partitions, with smaller interpartition 
data transfer, are appropriate for small packet sizes. The reverse is true for 9-point stencils: 

hexagonal partitions are appropriate for large packet sizes (even though they are 1— partitions 

6 

for the 9-point stencil), and square partitions are best for small packet sizes. The interaction of 
parameters cannot be ignored when considering the performance of an algorithm on a particular 
architecture. 

Recognizing the interdependence of parameters, Saltz et. al. [Salt86] recently evaluated the 
Intel iPSC for solution of the heat equation using Successive Over Relaxation (SOR). They 
observed that performance on the iPSC, with its high transmission startup cost and large 
packets, varied greatly with the size of the grid and the shape of the grid partitions. For small 
grids, horizontal strips, although requiring more interpartition data transfer, were preferable to 
square partitions. Only when the grid became large did the advantage of square partitions 
become apparent. The reasons are precisely those observed in Figures 4.2 and 4.3: amount of 
data versus number of communicating partitions. This validation of our analytic techniques 
suggests that they can effectively be used to determine the appropriate combination of partition 
shape and size given the architectural parameters of the underlying parallel machine. 

4.4* Shared Memory Architectures 

Unlike a message passing architecture where partitions exchange values via explicit 
messages, a shared memory implementation stores all partition values that must be exchanged in 
global, shared memory. The values associated with all other grid points are kept in memories 
local to each processor. 
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Figure 4.4 

Speedup for 9-point stencil 
(1024 X 1024 grid with 1024 byte packets) 
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Forwarding 
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6X10 -3 sec 
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Figure 4.5 

Speedup for 9-point stencil 
(1024 X 1024 grid with 16 byte packets) 
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Just as for message passing, the iteration time for evaluating one partition of grid points is 

t v cy -r cee,or = t comp + t a +t w , ( 4 . 3 ) 

only the interpretation of the access (t a ) and waiting ( t w ) times differ. With a message passing 
architecture, these times depend on the contention for communication links. Analogously, shared 
memory delays arise from memory contention. Vrsalovic, et al. [Vrsa85] observed that the 
expected waiting time for memory access takes the form 



max JO, 



- t 


i 

comp \ 


synchronous 

asynchronous 


(4.4) 


where C is the number of processors that can access shared memory simultaneously, and t a is the 
memory access time. The synchronous case, where all processors simultaneously attempt to 
access global memory, forces one processor to wait until all others have accessed memory. The 
length of this delay depends on the number of simultaneous memory accesses supported. If the 
processors operate asynchronously, allowing overlap of computation and memory access, the level 
of memory contention is reduced. In the simplest case, a set of global memory modules 
connected to a shared bus, the number of concurrent memory accesses C is just 1. If a multistage 
switching network connects processors and memories, (4.4) can be replaced with a waiting time 
function [Krus83]. Whatever the interconnection network, t w reflects the effects of memory 
contention. We will return to this later, but first we consider the expected amount of data 
transferred to/from shared memory. 

When considering a shared memory implementation of an iteration technique, two choices 
arise: local copies of partition boundaries or only global storage. In the first case, each partition 
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not only retains a copy of its boundary values after writing them to global memory but also 
copies into local memory all boundaries needed from other partitions. With only global storage, 
the boundary values are accessed in global memory each time they are needed. The performance 
of these two implementations differs considerably based on the stencil, partition, and memory 
access technique. Hence, we consider local copies and global access for both 5-point and 9-point 
stencils with rectangular strips, square, and hexagonal partitions. 

For notational convenience, we let g l ° cd cop,es be the number of global memory accesses for 
copying boundary values to and from a partition (assuming local copies), g™ C0 P'" be the number 
of global memory accesses (assuming no local copies), t g be the time to access one value from 
global memory, and t l be the processor overhead associated with copying one boundary value. 
With this notation, the cycle time for one iteration is 

flocal copiee — rp . local copies , fi , ± \ . . f j e\ 

l cpcle p 1 fp + 9 a ' (t g + t[ ) + t w (4.5) 

or 

C /T" = E(S)^~ T fp + g? cop, ' e * • t g + t w , (4.6) 

where the three terms correspond to those in (4.3). As with message passing implementations, 
E(S) is the number of floating point operations required to update each grid point given stencil 
S , p is the number of partitions, and Tj p is the time needed for one floating point operation. 
When local copies are used, some processor overhead may be required to maintain the copies 
(e.g., copying from system buffers to user memory); this overhead is reflected by t ( . Finally, t g 
and f, are hardware parameters; only g a depends on the choice of local copies or global access. 
Thus, we concentrate on derivations of g[ ocd and g™ co ” ies for selected combinations of 
stencils and partitions. The results, derived below, are summarized in Table 7. 
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Table 7 Shared memory data transfers 



Rectangular 




Stencil 

Strip 


Square 

Hexagon 

5-point 





local 
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4(n-2) 
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8 n . 


6n 

Vp 

„ global 

12 (n - 2) 

24 

_2i_ _ i 



9 a 

Vp 


v p 






„local 

9o. 

! 

4 (n — 2) 


Sn 

Vp 

iSi-4 

Vp 

global 
y a 

20(n - 2) 

- 

40n o 

v7 + S 

48n .. 

— t=- — 44 

Vp 


The values of g l ° cal copi€8 for both the 5-point and ;)-point stencils can be easily determined 
from Figures 2.2, 2.4, and 2.7. For example, Figure 2.2(b) (with r = Vp ) shows that a square 

partition reads — S— data global memory values from each of its four neighbors and writes its 

V p 

own four boundaries back to global memory. 5 The total number of global memory accesses is 
then 


4 n 

vr 


+ 4 


Vp 


- 1 


8n 

Vp 


-4. 


The 9-point stencil is similar, requiring four extra boundary values, one from each of the 
diagonally adjacent partitions. 


B The four corner points are written only once. 
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The situation changes dramatically if no local copies of boundary data are maintained. 
Boundary values are often used multiple times during an iteration. With a 5-point stencil and 
square partitions, updating a single element on the boundary generally requires access to three 
values on the partition’s boundary and one access to another partition’s boundary. The updated 

boundary element must then be rewritten. Hence, five memory accesses are required if no copies 
are maintained. 

The penalty for not maintaining local copies is even more striking for 9-point stencils. 
Figure 4.6 shows the number of global memory accesses for each point in an interior square 
partition when the 9-point stencil is used and no local copies are maintained. The numbers are 


Figure 4.6 

Global memory accesses for 9-point stencil 
(square partitions with no local copies) 
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determined by counting boundary grid points needed to update the value at each grid point. If 
the grid point lies on the boundary, the count is increased by two: the old value must be be read 
from global memory and the new value written. Obtaining a general formula for the number of 
memory accesses is straightforward, given diagrams such at Figure 4.6. For a 9-point stencil 

with square partitions, + 8 memory accesses are needed, a five-fold increase over that when 

Vp 

local copies are maintained. 

4.5, Shared Memory Analysis 

Given an analysis of the memory traffic required for maintaining local copies or always 
accessing shared memory, only hardware parameters and some assumption about the underlying 
interconnection network are needed to predict performance. As noted earlier, the memory 
contention function t w can reflect a variety of interconnection strategies ranging from a single 
global memory bus to a multistage interconnection network. Because the importance of local 
copies is most striking when memory contention is severe, we have concentrated on the worst 
case: a single bus connecting all global memories and processors. 

Figure 4.7 illustrates the speedup obtained for square and hexagonal partitions on a 5-point 
stencil with varying numbers of processors and local copies of partition boundaries. In the figure, 
access to global memory is assumed to require five times that for a single floating point 
operation. The figure confirms the analysis of Table 6 and 7: hexagons are the preferred 
partition type. For small numbers of processors, computation time dominates, and there is little 
distinction between the partition types. However, as the number of processors increases, the 
smaller interpartition data transfer required by hexagons makes their use attractive. Speedup 
increases with the number of processors until the global memory bus becomes a bottleneck; at 
that point speedup begins to decline. 
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Figure 4.7 

Shared memory speedup for 5-point stencil 
(local copies with grid size 1024 X 1024) 
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Figure 4.8 

Shared memory speedup for 9-point stencil 
(local copies with grid size 1024 X 1024) 
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Figure 4.8 shows a result similar to Figure 4.7, except for 9-point stencils. As Table 7 
suggests, square partitions are preferred. A comparison of Figures 4.7 and 4.8 shows that the 9- 
point stencil gives larger absolute speedup. The reason is intuitive: the greater computation cost 
at each grid point more than offsets the increased communication cost for the 9-point stencil. 
Hence, an equal number of processors translates into a greater speedup. 

Finally, Figure 4.9 compares maintaining local copies of boundaries to continued access to 
global memory. This figure also confirms what Table 7 suggests; local copies are clearly 
advantageous. The argument for storing boundaries in local memories is compelling. Without 
such copies, the bandwidth of the global bus quickly saturates. 

5. Conclusions 

The trio of iteration stencil, grid partition shape, and underlying parallel architecture must 
be considered together when designing parallel algorithms for solution of elliptic partial 
differential equations. Isolated evaluation of one or even two components of the trio is likely to 
yield non-optimal algorithms. 

We have seen, for example, that an abstract analysis of iteration stencil and partition shape 
suggests that hexagonal partitions are best for 5-point stencils, whereas square partitions are 
best for 9— point stencils. Further analysis shows that this is only true in a message passing 
implementation if small packets are supported. For large packets, the reverse is true (i.e., square 
partitions for 5-point stencils and hexagonal partitions for 9-point stencils). Likewise, the type 
of interconnection network is crucial. Mapping grid partitions onto a network that does not 
directly support the interpartition communication pattern markedly degrades performance. 
Finally, when considering shared memory implementation of the iterations, maintaining local 
copies of the partition boundaries is imperative. Without local copies, or an extremely fast 
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Figure 4.9 

Speedup for 9-point stencil with grid size 1024 X 1024 
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interconnection network, the observed speedups are extremely small. Consequently, only a small 
number of processors can be used effectively. 

In summary, stencil, partition shape, and architecture must be considered in concert when 
designing an iterative solution algorithm. They interact in non-intuitive ways and ignoring one 
or more of the three almost certainly leads to sub-optimal performance. 
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